Skip to content

Avoid retaining saved tensors in fused norm custom ops under no_grad#2012

Open
LeSingh1 wants to merge 1 commit into
NVIDIA:masterfrom
LeSingh1:fix-1999-fused-rmsnorm-nograd-leak
Open

Avoid retaining saved tensors in fused norm custom ops under no_grad#2012
LeSingh1 wants to merge 1 commit into
NVIDIA:masterfrom
LeSingh1:fix-1999-fused-rmsnorm-nograd-leak

Conversation

@LeSingh1
Copy link
Copy Markdown
Contributor

Problem

FusedRMSNorm (and the sibling fused layer/RMS norm custom ops) leak two CUDA tensors per forward call under torch.no_grad(), as reported in #1999. On the torch.library.custom_op path (PyTorch >= 2.4) the setup_context functions unconditionally call save_for_backward; those saved tensors are retained in autograd metadata that is not released after a no_grad forward, leaking the saved activation + invvar each call.

Fix

In each affected setup_context (apex/normalization/fused_layer_norm.py), assign the scalar ctx fields first, then return early when torch.is_grad_enabled() is False, skipping save_for_backward. Backward can never run under no_grad, so nothing is lost, and the grad-enabled training path is unchanged.

Testing / verification status

Runtime-unverified — this was developed on a machine without a CUDA GPU, so the leak reproducer was not executed. The change compiles (py_compile) and passes ruff, and the reasoning is that skipping the save under no_grad removes the retained references regardless of the exact internal mechanism. I'd appreciate a maintainer with a GPU running the issue's count_cuda_tensors reproducer to confirm delta == 0 before merge; happy to adjust if the leak originates elsewhere.

Developed with AI assistance.

Addresses #1999

The custom-op forward path for FusedRMSNorm/FusedLayerNorm registers an autograd
setup_context that unconditionally calls save_for_backward. For torch.library
custom ops these saved tensors are retained in autograd metadata that is not
released after the call returns, so each forward under torch.no_grad() leaks the
saved activation and the invvar tensor (two CUDA tensors per call), accumulating
linearly in long-running inference (issue NVIDIA#1999).

Skip the save_for_backward calls when grad is disabled, since backward can never
run in that case. The grad-enabled training path is unchanged.

Signed-off-by: LeSingh1 <sshaurya914@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant